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ABSTRACT 

Given a directed acyclic graph with labeled vertices, we 
consider the problem of finding the most common label se- 
quences ("traces") among all paths in the graph (of some 
maximum length m). Since the number of paths can be 
huge, we propose novel algorithms whose time complexity 
depends only on the size of the graph, and on the relative 
frequency e of the most frequent traces. In addition, we ap- 
ply techniques from streaming algorithms to achieve space 
usage that depends only on e, and not on the number of 
distinct traces. 

The abstract problem considered models a variety of tasks 
concerning finding frequent patterns in event sequences. Our 
motivation comes from working with a data set of 2 million 
RFID readings from baggage trolleys at Copenhagen Air- 
port. The question of finding frequent passenger movement 
patterns is mapped to the above problem. We report on 
experimental findings for this data set. 

1. INTRODUCTION 

Sequential pattern mining has attracted a lot of interest in 
recent years. However, some of the probabilistic techniques 
that have proven their efficiency in mining of frequent item- 
sets have, to our best knowledge, not been transferred to the 
realm of sequence mining. The aim of this paper is to take 
a step in that direction, namely, we propose an analogue 
of Toivonen's sampling-based algorithm for frequent itemset 
mining 15 in the context of sequential patterns. 



At a conceptual level we work with a new, simple formula- 
tion of the problem: The input is a directed acyclic graph 
(DAG) where the vertices are events and there is an edge be- 
tween two events if they are considered to be connected (i.e.. 
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part of the same event sequences) . Vertices are labeled by 
the type of event they represent. This allows certain fiexibil- 
ity in modeling that is lacking in many other formulations: 



Spatio-temporal events can be connected based on both 
spatial and temporal closeness. 

Events that have an associated time range (rather than 
a single time stamp) can be connected based on an 
arbitrary closeness criterion. 



The data mining task we consider is to find the most com- 
mon sequences of event types ("traces") among all paths in 
the DAG, or more generally all paths of some maximum 
length m. The challenge is to handle the huge number of 
paths that may be present in a DAG. Our approach rests 
on a novel sampling procedure that is able to create a sam- 
ple of any desired size, in time that is linear in the size of 
the DAG (for preprocessing) and the size of the sample (for 
sampling) . This allows a time complexity for the mining pro- 
cedure that depends on the relative frequency e of the most 
common traces rather than the total number of traces. We 
also apply a technique from data streaming algorithms to 
achieve space that depends on e rather than on the number 
of distinct traces. 

Though our formulation does not capture all the many as- 
pects present in other approaches to sequential pattern min- 
ing, we believe that it possesses an attractive combination 
of expressive modeling and algorithmic tractability. 

1.1 Problem definition 

We are given a directed acyclic graph G = {V,E), and a 
function label: 1/ — >■ L that maps vertices to their labels. A 
path p in G is a sequence of vertices vi,V2, ■ ■ ■ ,Vj € V such 
that {vi,Vi+i) G E for i — 1, . . . ,j — 1. A path p has a trace 
label (p), which is the vector of labels on the path. Let Sm 
denote the multiset of all path traces of length at most m, 
i.e., 

Sm = {label(p) I p is a path in G of length at most m} . 

The data mining task is to find the most frequent traces in 
Sm- It comes in several flavors: 



• Top-fc. For a parameter k, find the k traces that have 
the most occurrences in Sm (breaking ties arbitrarily). 



• Frequency e. Find the set of traces that have relative 
frequency e or more in Sm- 

• Monte Carlo. For both the above variants we can 
allow an error probability 5 (typically allowing a false 
negative probability, i.e., that we fail to report a trace 
with probability S). 

In this paper emphasis will be on Monte Carlo algorithms 
for the frequency variant. However, we note one can also 
obtain results for top-fc by a simple reduction. 



procedure AllTraces(u, t, J) 
if i > then 

output t|jlabel(w) 

for each v' where {v,v') £ E do 

ALLTRACES(?;',t||label(i'),i — 1) 
end for 
end if 
end procedure 



9: for u e 1/ do 

10: ALLTRACES(f,e, m) 
11: end for 



1.2 Related work 

There is a large body of related work on sequence data min- 
ing, see e.g. (fe)[l4][8]|6][l8]|5][2||l3]. These works deviate 
from the present one in that they consider the input as a 
sequence of timestamped events, and allow a host of for- 
mulations of what kinds of subsequences are of interest. In 
contrast, we put the modeling of interesting subsequences 
into the description of the event sequence (by defining DAG 
edges), and the patterns sought are simple strings. This al- 
lows us to do things that we believe have not been done, and 
are probably difficult, in traditional sequential data mining 
settings, namely making use of sampling methods. The dif- 
ficulty with sampling is, of course, that patterns can overlap 
in complicated ways, so any straightforward approach (such 
as sampling nodes or edges) will fail to give independent 
samples. 

Another related area is algorithms for finding frequent sub- 
graphs in graphs, see e.g. [17[ [7|[TT|[4]. Indeed, the problem 
we consider can be seen as that of finding frequent (labeled) 
paths in an acyclic graph. Our work deviates from previous 
works mainly in that we consider directed acyclic graphs 
rather than general (undirected) graphs. This allows us to 
present algorithms with provable upper bounds on space us- 
age and running time. No such efficient bounds are possible 
for general graphs: Even the problem of determining if a 
graph contains a simple path of length k requires time expo- 
nential in fe [l] [Tg] , and this is inevitable assuming the hamil- 
ton cycle problem requires exponential time in the number 
of vertices (a well-established hypothesis). In addition, we 
believe that this is the first use of sampling methods in the 
context of finding frequent subgraphs. Possibly, this could 
inspire further work on using sampling in graph mining. 

2. OUR SOLUTION 

2.1 Generation of all traces 

As a warmup we consider the task of producing the multiset 
of all traces having maximum length m. We will use the 
notation Si (v) to denote the multiset of traces corresponding 
to paths (of length at most m) starting in node v. Clearly 
So{v) = 0. For i > we have the recursive definition 

S^{v) = {label(«)} X (e U (J S^^l{v')), 

where e denotes the empty trace, and |J is nmltiset union. 
Clearly we have Sm = U^gv S,niv). 

These equalities lead to a simple recursive algorithm, shown 
in Figure IT] It is easy to see that if traces are represented 
in a reasonable way (e.g. as singly linked lists) the running 



Figure 1: The procedure AllTraces outputs the 
concatenation of a trace prefix t, and each trace 
starting at v having length at most i. The nota- 
tion jl is for concatenation of traces. Lines 7—9 call 
AllTraces for all vertices v, Avith the empty trace e 
as prefix, producing the multiset Sm of all traces of 
length at most m. 



time is linear in the size |yj + jiJ| of the graph and the total 
length of the traces generated. 

Succinct output. If we are satisfied with returning hash 
values of the traces (unique with high probability) the time 
can be improved such that only 0(1) time is used for each 
trace, i.e. time C'(|y| + |i5|-|-|5'm|) in total. This can be done 
using a standard incremental string hashing method such as 
Karp-Rabin [9]. Observe that the output is sufficient to find 
the hash values of the most frequent traces in Sm (with a 
negligible error probability). A second run of the procedure 
could then output the actual frequent traces, e.g. by looking 
up the count of each hash value computed. 

2.2 Generation of a random sample 

If the patterns we are interested in occur many times, sub- 
stantial savings in time can be obtained by employing a sam- 
pling procedure. That is, rather than generating Sm explic- 
itly we are interested in an algorithm that produces each 
trace in Sm with a given probability p, independently. This 
will reduce the expected number of samples to a fraction p of 
the original. The choice of p is constrained by the fact that 
we still want to sample each frequent trace a fair number of 
times (to minimize the probability of false negatives being 
introduced by the sampling). 



Counting phase. Our algorithm starts by computing, for 
i — 1, . . . ,m the number of paths v.c[i] of length at most 
i that start in each vertex v. We assume that this can be 
done using standard precision (e.g. 64 bit) integers. The al- 
gorithm shown in Figure [2] mimics the structure of the nai've 
generation algorithm, but uses memoization (aka. dynamic 
programming) to reduce the running time. 

For each i < m the cost of all calls to CountTraces with 
parameters {v,i), disregarding the cost of recursive calls, is 
easily seen to be proportional to the number of edges inci- 
dent to V. This means that the total time complexity of the 
counting phase is OdSIm). The space usage is dominated 
by an array of size m for each vertex, i.e., it is 0(1 V|m). 



1: function CountTraces(u,j) 

2: if v.c[i] =null then 

3: v.c[i\ ^ 1 

4: for each v' where {v,v') G E do 

5: v.c\i] ■<— i'.c[i]+CouNTTRACES(u',i — 1) 
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end for 
end if 
return v.c[i] 
end function 

for V £ V do 

CountTraces(i', m) 
end for 



Figure 2: Recursive computation of the paths of 
traces for each starting vertex, using memoization. 
It assumes that each value v.c[0] is initially set to 
zero, and each value ?;.c[i], < i < m, is initially null. 



1 

2 

3 

4 

5 

6 

7: 

8 

9 

10 

11 

12 

13 

14 
15 
16 
17 



procedure SampleTraces(i', i, i) 
out <— false 
for each v' where {v, v') £ E do 

if rand()> {1 - pf' ■"'^''^'^ / {1 - (1 -p)"-"^) then 
SAMPLETRACES(w',t||label(i;),i — 1) 
out <— true 
end if 
end for 
if out — false or rand()< p then 

output i||label(u) 
end if 
end procedure 

for V £ V do 

if rand()> (1-p)""''"' then 

SAMPLETRACES(ti, e, m) 

end if 
end for 



Sampling phase. Consider the multiset Si{v) of traces, 
which has size v.c[i] by definition. The probability that none 
of these traces are sampled should be (1 — p)"*^'*'. Condi- 
tioned on the event that at least one trace from Si{v) is 
sampled, we either have to sample a trace of length more 
than one (starting with label(«)), or include the trace {v} 
in the sample. In a nutshell, this is what the procedure 
SampleTraces of Figure |3] does. 

Let rand() denote a function the returns a uniformly ran- 
dom number in [0; 1], independently of previously returned 
values. The condition rand() > (1— p)"'^'"*' holds with prob- 
ability l-(l-p)"'=''"l, so lines 14-16 call SampleTraces if 
and only if we need to sample at least one trace from Sm{v)- 
In the procedure SampleTraces we use, similarly to above, 
a parameter t to pass along a trace prefix. The variable out 
is used to keep track of whether a trace has been output in 
the recursive calls. If out is false after all recursive calls we 
sample ijjlabel(ii). For each v' with {v,v') £ E the probabil- 
ity that we do not sample any trace from \a,hel{v)\\Si-i{v') 
is (l-p)"'-'=''-y/(l-(l-p)"-=W). This is exactly the correct 
probability since we condition on at least one trace in Si(v) 
being sampled. 

Refinement. Observe that the probability in line 4 may be 
precomputed for each edge and value of i. Even with this 
optimization, a direct implementation of the pseudocode in 
Figurelslmay spend a lot of time in the for loop of Sample- 
Traces without producing any output. To get a theoreti- 
cally satisfying solution we may preprocess, for each (11,1), 
the probabilities pi,p2, ■ ■ . ,pd oi making the recursive calls. 
Specifically, for j = 0, . . . , d we consider the probabilities 
qj = nj/<j(l —Pj') that no recursive call is made in the first 
j iterations. If we choose r uniformly at random in [0; 1] then 
the probability that qj-i > r > qj is exactly the probability 
that the first recursive call is in the jih iteration. Similarly, 
the probability that r > g^ is exactly the probability that no 
recursive call is made. Thus, by doing a binary search for r 
over qd, ■ ■ ■ ,qo we may choose, with the correct probability, 
the first iteration ji in which there should be a recursive 
call. The same method can be repeated, using a random 
value r in [0; g^J to find the next recursive call, and so on. 



Figure 3: The procedure SampleTraces outputs the 
concatenation of a trace prefix t and a random sam- 
ple of the traces starting at v of length at most i. 
The traces are sampled from the conditional dis- 
tribution that is guaranteed to sample at least one 
trace. As before, the notation is for concatenation 
of traces, and e denotes the empty trace. Lines 13—17 
call SampleTraces for each vertex v with probability 
1 — (1— p)"'^''', to produce a sample of all traces start- 
ing at V having length at most i, where each trace is 
chosen independently at random with probability p. 



In the worst case this uses time ©(loglVj) per recursive 
call. We can exploit the fact that we are searching for a 
random value r to decrease this to 0(1) expected time. The 
basic idea is to place the probabilities qj in buckets according 
to the logd most significant bits, and furthermore store in 
each bucket its predecessor (i.e., the maximum j such that 
qj is smaller than all elements in the bucket). Given r, we 
can find ji by inspecting the values in the bucket that r 
belongs to (the elements, and their predecessor). This will 
take expected time 0(1) since r is random and the average 
number of values per bucket is 1. 

To make this work not just for the first search, we adjust 
the bucketing as follows: We partition qi, . . . ,qd according 
to the number of leading Os in the binary representations 
(wlog. there are O(logn), since we can rely on brute-force 
search for low probability events, i.e., if r gets very small). 
In each partition, containing d' values, we partition the val- 
ues in buckets according to the logd' most significant bits. 
As before, we store the predecessor of each bucket. It is 
clear that this data structure requires 0(d) space, and can 
be constructed in time 0{d). A search for random r in 
[0; 7] happens in the structure corresponding to the number 
of leading Os in 7. This will choose a random bucket of ex- 
pected size 0(1), and the analysis finishes as before. If there 
are no qj values with the right number of leading Os, we use 
a special structure of O(logn) bits to find the partition of 
the predecessor in 0(1) time. 

As before, we can choose to have a succinct output where 
traces are represented by the hash values of their traces, 
with no increase in time complexity. 



2.3 Time and error analysis 

For the time analysis we focus on the refined implementation 
described above, since it allows a clean and exact theoreti- 
cal analysis. A similar analysis of the version stated in the 
pseudocode can be made under the assumption that the out- 
degree of vertices in G is bounded by a constant. Observe 
that if SampleTraces makes c recursive calls this takes ex- 
pected time 0(1 ~\- c). Also observe that the total number of 
procedure calls is upper bounded by the total length of all 
sampled traces — this is because each recursive call is guar- 
anteed to produce at least one output. Combining these 
facts we see that the expected time for all calls to Sam- 
pleTraces is linear in the length I of all traces sampled. 
Notice that the expected value of I is 0{p\Sm\m). Since l is 
independent of the random choices determining the running 
time of the data structure in the refined implementation we 
can conclude that the total expected running time of the 
code in Figures [2] and [s] is C(|V^| + \E\rn -\- p\Sm\m). 

The parameter p must be chosen such that p = C/e, where 
C > 1 is a parameter that determines the false negative 
probability. The expected number of times that we sample 
a trace with frequency e' is Ce' /e, and since the samples 
are independent the number of samples follows a binomial 
distribution. By Chernoff bounds, this means that if e' > e 
then the number of samples is at least C/2 with probability 
1 — 2^ *■ '. Concrete error probabilities for C — IQ are 
discussed in our experimental section. We have the following 
theoretical result: 



In order to solve the frequent items problem without false 
positives, which in our case means without reporting traces 
whose frequency is below e, we will make two passes, i.e., 
generate the sample twice and do exact counting of poten- 
tially frequent items in the second pass. This will roughly 
double the running time. 

Lemma 3. Given a stream of elements representing the 
set of samples of traces produced by SampleTraces, the 
space needed in order to output the traces with frequency 
at least e, without producing any trace with frequency less 
than e, is 0{l/e) words. 

Let freq(t, Sm) denote f's fraction of Sm (viewed as a multi- 
set). E.g., if S2 ~ {aa, aa, ab, ba, hb} we have freq(aa, S2) = 
2/5. Putting together Theorem [I] and the above lemma, we 
get: 



Theorem 4. Let e and 5 be positive reals. In expected 
time 0(1 V| + |-B|7Ti + log(l/(5)/£) and space 0{l/e) we can 
produce a setT of Oil/ e) traces, and accompanying random 
variables Xt, t £ T , such that: 



• For each t with freq{t, S,n) > e, Pr[t & T] > 1 — S, and 

• for each t £ S , Xt has binomial distribution with mean 
'freq{t,Sm)f{e,S), where f{e,5) = e(log(l/<5)/e). 



Theorem 1. We can generate a random sample of Sm 
in expected time 0(1^1 -I- \E\m -j- \og{l/S)/e) such that each 
trace with frequency e or more has frequency at least e/2 in 
the random sample with probability 1 — 5. 



Observe that the running time is independent of the total 
number of traces in Sm ■ 

2.4 Putting things together 

It remains to assess how to choose, among the samples, the 
ones that are actually interesting. In particular, we are in- 
terested in those traces appearing in the sample at least C/2 
times. 

This problem can be efficiently addressed used a frequent 
items algorithm. Such algorithms have been designed for use 
in a data streaming context, and guarantee low space usage. 
A comprehensive treatment and an experimental compari- 
son between various techniques can be found in [s]. The 
problem itself dates back at least to the 1980s, and can be 
formalized in this way: 

Definition 2. Given a stream S of n elements and a 
frequency threshold r], the frequent items problem asks for 
the set T of items that occur at least r\ times. 

The algorithms addressing this problem usually solve a re- 
laxed version where a modest number of false positives can 
appear in the output, since this reduces the space require- 
ments to 0{n/ri). For completeness, we describe a concrete 
frequent items implementation in Appendix \K\ 



The first property says that the probability that a frequent 
trace is not reported is at most 5. The second property says 
that the frequency of the traces in T can be estimated, with 
strong statistical guarantees, since the Xt values come from 
a highly concentrated distribution with mean proportional 
to freq(i,Sm). 

3. FROM EVENT SEQUENCE TO A DAG 

An event sequence is a set S of tuples of the form {t,i,£), 
where t € IR is a time stamp, j is a tag identifier, and £ is 
a label (in our case, ^ is a location identifier that indicates 
an approximate location, namely vicinity of an antenna). 
In this work we do not consider the physical locations of 
antenna as part of the input. 

Formally we may define the problem as follows: For a given 
number A, the input set specifies a directed acyclic graph 
Ga ~ {V, -Ba), where each observation is a vertex, and there 
is an edge from vi to V2 if and only if the vertices are obser- 
vations of the same tag, at different locations, separated by 
at most A time units (we use minutes as the time unit from 
now on). 

To produce the DAG we sort the data by tag ID and time- 
stamp. Note that this makes it easy to find all the edges 
from a particular vertex v in Ga: Simply scan the sorted list 
forward until either the timestamp differs by more than A 
from that of v, or we reach a node corresponding to another 
tag. 

Example. If A = 20 and we observe locations 1, 2, 3, 6, 
7 at time 10, 20, 30, 60, 70, the following subsequences are 



considered to reflect a movement: 1-2, 2-3, 1-2-3, 1-3, 6-7. 
Notice the inclusion of 1-3, where one observation is skipped, 
since there is at most A minutes between the observation of 
1 and 3. 

3.1 Converting the RFID data 

We have worked with a data set consisting of readings of 
RFID (Radio-Frequency ID) tags by fixed-position antenna. 
RFID chips can be identified only when they are in the prox- 
imity of an antenna, which means that readings give approx- 
imate information about the location of an RFID tag. Such 
data sets, as well as similar data sets based on other tech- 
nologies, are becoming increasingly available as more and 
more items, from parcels to items in shops, are being tagged 
with RFID chips. 

Before using the RFID data to create a DAG, we have 
cleaned some of the noise present in the data. One source 
of noise was the presence of sequences of readings regarding 
trolleys remaining in zones where the range of two antennas 
is overlapping. This gave rise to sequences of alternating 
readings of the form {x'^y^){x^y^)^ . In order to clean up 
these interferences, we replaced such sequences by a new 
zone label that represents the zone of overlap of the range 
of antennas. In particular we have used, for a sequence 
{x^y'^){x'^y^)'^ , the label min{a;, y}* 100-1- max{a;, y}. This 
can be thought of as an increase in the spatial resolution of 
the readings. 

Another source of noise, sometimes connected with the one 
just described, is the presence of sequences of readings re- 
garding the same zone for a given trolley. In order to avoid 
having traces of the form t — (Vyy'^W), where V and W 
are sequences of readings, we considered only one occurrence 
of y, properly managing the timestamps of the readings. In 
particular this means that, assuming the difference in time 
between any two consecutive y is within the threshold A, 
in the DAG we put a directed edge {v,y), v G V iff the 
first occurrence of y after V occurred within time A from 
V. Moreover we put a directed edge {y,w), w £ W iff w 
happened within time A from the last reading of y in t. 

4. EXPERIMENTS 

For the experiments we have used the RFID dataset de- 
scribed above. We have used this dataset since it suits quite 
well the needs of the abstract formulation of the problem, 
and is massive enough to be challenging for our algorithm. 
Moreover, did not manage to find interesting, raw DAG 
data. However, it would be of interest to try our algorithms 
on DAGs derived from other (publicly available) data sets. 

We ran a set of experiments on the data, in order to under- 
stand how many patterns would have been generated for a 
given A and a size m. Fig. |6] shows the size of the graph for 
different sizes of A. We compared the obtained results with 
the expected performance of our algorithm (from the theo- 
retical analysis). For space usage this gives a rather precise 
idea about the savings that can be obtained. For time us- 
age, there is greater uncertainty, since the time is influenced 
by the constant factors in the implementation (which again 
depends on the hardware on which we run the experiments). 
It would be of interest to investigate the performance of a 
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Figure 5: Size of the airport DAG for different val- 
ues of A. As can be seen all graphs are quite sparse, 
and in fact many nodes have no outgoing edges. This 
is due to a relatively low resolution in the data set. 
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Figure 6: Characteristics of the data for several com- 
binations of A and m. The third column, Tot. traces, 
represents the total number of traces that would be 
generated by the naive approach; the Dis. traces col- 
umn represents the number of distinc traces; the 
top 100th column contains the frequency of the 100th 
most frequent trace; the column ratio represents the 
saving we would achive using a frequency threshold 
equal to the one represented in the top 100th col- 



concrete, tuned implementation to see how close one can get 
to the theoretical gains. 

Fig[6] reports some interesting characteristics of the data 
when varying A and m. In particular the table contains 
the number of traces generated, the frequency of the 100th 
most frequent trace and the ratio between the space needed 
in case of an exact computation and the space required when 
our algorithm is used. Note that the space to represent the 
DAG and the counts is not taken into account in this ratio. 
The rationale for this is that as we consider longer event se- 
quences the space for the DAG representation is expected to 
become negligible compared to the space needed for flnding 
the most common traces. 

From the results of the test it is clear that great savings 
can be achieved when the frequencies we are interested in 
are not too low. In a case, nearly 3 orders of magnitude of 
space can be saved using our approach. 

Fig. It] shows the number of samples we would take in ex- 
pectation when C = 10 is used. The table gives the flavor 
of the saving in time that could be achieved with respect 
to generating all the possible traces. It is worth noticing 
that with C = 10 we would end up with a probability of 
reporting a false positive that is lower than 7% (this can be 
seen by considering the probability that a Poisson random 
variable with mean 10 or more has value less than 5). Here 
we notice that the total number of traces is already 1-2 or- 
ders of magnitude larger than the size of the DAG, so we 
expect an improvement in running time of at least 1 order 
of magnitude. 
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Figure 4: RFID antenna in Copenhagen Airport. 
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Figure 7: The ratio bet^veen the total number of 
traces and the number of samples we would take 
using C = 10. Whit this value of C, the probability 
of having false negatives w^ould be approximately 7% 
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APPENDIX 

A. A CONCRETE FREQUENT ITEMS IM- 
PLEMENTATION 

For completeness, we will describe in a high level fashion 
one of the several frequent items algorithms existing in liter- 
ature. The algorithm is presented in [lO] . We are interested 
in reporting the traces appearing at least C/2 times in the 
sample. For this purpose we maintain a set of 2p\Sm\/C en- 
tries; each entry contains the label of the trace and a counter. 
Every time SampleTraces outputs a trace t, we look at the 
set of entries and depending on whether the trace is already 
recorded in one of the entries or not, we take one of two 
choices: 

t appears in entry i: we add 1 the counter associated with 
the entry i; 

t does not appear in any entry: we decrease by 1 all the 
counters; if a counter reaches we remove the corre- 
sponding trace from the entry. 

This algorithm guarantees to find all the traces with fre- 
quency above the threshold C/2, but could return traces 
with frequency below the threshold. In order to eliminate 
this traces from the output, a second pass over the sample 
is required to get exact occurrence counts. There are two 
possible ways of doing this: Either one can generate exactly 
the same sample again (using a pseudorandom generator 
with the same seed, or simply by storing the random choices 
made). The other way (which is what we analyze theoret- 
ically) is to take a new, random sample and count exactly 
the number of occurrences of those elements that were found 
to be "possibly frequent" in the first sample. This increases 
the probability of false negatives by almost a factor of 2, so 
to compensate for this one needs to slightly increase C. 



